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rrj . Abstract: We address the problem of constructing varying-coefficient models based 

on basis expansions along with the technique of regularization. A crucial point in our 

modeling procedure is the selection of smoothing parameters in the regularization 

C^ ' method. In order to choose the parameters objectively, we derive model selection 

^, ' criteria from the viewpoints of information-theoretic and Bayesian approach. We 

demonstrate the effectiveness of proposed modeling strategy through Monte Carlo 

►^ , simulations and analyzing a real data set. 

OO . 
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: 1 Introduction 

Longitudinal data are encountered in various fields, e.g., medical research, economic sci- 
ence and so on. In the setting of longitudinal study, the outcome data are measured 
r\ I repeatedly over time for each individual. Nowadays many modeling strategies have been 

c^ ■ studied for analyzing longitudinal data, both in parametric and nonparametric way (see, 

for example, Diggle et ai, 1994; Vonesh and Chinchilli, 1997). 

Under linear parametric modeling for longitudinal data, a linear mixed-effect model is 
widely used in several literatures (see, for example, Verbeke and Molenberghs, 2000). The 
advantage of linear mixed-effect model is easy to handle the unbalanced data, which are 
highly occurred in the longitudinal study. Meanwhile, nonparametric regression (Ruppert 
et ai, 2003) and functional data analysis (Ramsay and Silverman, 2005) has come to the 
front recently for the nonparametric approach toward longitudinal data analysis. 

They can capture the complex structure in the longitudinal data effectively. While 
these parametric or nonparametric approaches are very useful, there are some problems 
about the adequacy of the model assumptions and the potential impact of model mis- 
specifications on the analysis, which is especially arisen in parametric models (Hoover et 
ai, 1998). In addition, it is not unusual that covariates may depend on time progresses. 
Nevertheless, these approaches does not necessarily consider studying an association be- 
tween covariates and a response with time. 
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One of the most useful model to overcome this problem is the varying-coefiicient 
model (VCM). Hastie and Tibshirani (1993) proposed the smoothing spline method for 
estimating VCMs. The essential idea behind the VCM is that coefficients of regression 
model are represented as time- dependent function. It enables us to study the association 
between the time varying covariates and outcome. Hoover et al. (1998) presented two 
types of nonparametric estimation procedure for VCMs, smoothing spline and locally 
weighted polynomials. They used a cross validation for selecting smoothing parameters 
in smoothing spline method. However, the cross validation requires large computational 
time and yields the high variability since the selector is repeatedly applied. 

In this paper, we introduce a nonlinear varying-coefficient modeling strategy using a 
linear combination of basis functions and regularized likelihood estimation method for 
continuous longitudinal data. We also note that adjusted parameters included in our 
proposed model are regularization parameters. In order to choose these parameters, we 
derive model selection criteria from information-theoretic and Bayesian viewpoints. The 
proposed nonlinear varying coefficient modeling procedure is investigated through the 
analysis of real data and Monte Carlo simulations. 

The article is organized as follows. In Section 2 and 3, we present the varying coefficient 
model based on basis expansion and its estimation by the maximum penalized likelihood 
method. Section 4 provides model selection criteria derived from the information-theoretic 
and Bayesian approach. In Section 5 we describe Monte Carlo simulations in order to 
examine the effectiveness of our modeling procedure, and then we also apply the proposed 
method to Multicenter AIDS Cohort Study data in Section 6. Summary and discussion 
are given in Section 7. 

2 Varying-coefficient models 

Suppose we have p sets of predictors Xk {k = l,...,p) and a response varying with 
time Y{t), and denote i-th observations at time points j = l,...,?7,j as Xijk, and ijij, 
respectively. Then the varying-coefficient model has the form (Hoover et al., 1998) 



Vij Pov^ij ) ~r XijiPiyt^j J + ■ ■ ■ + XijpPp(T{j J -\- 



■ij: 



where l3o{-) , I3i{-) , . . . , [3p{-) are parameter functions and Sij are random noises whose vec- 
tor Si = {en, . . . ,eini)' are normally distributed with mean vector and a variance co- 
variance matrix Sj. 

We assume that coefficient functions [3q{-) , Pi^-) , . . . , I3p{-) are expressed by basis ex- 
pansions as follows; 

m=l 

where "y^ = ilki, • • • , Ikhh)' ^^^ parameters to be estimated and (jr'{tij) = {(j)\{tij), . . . , 
(pf, ''{tij))' are basis functions. There are various kinds of basis functions such as radial 
basis functions (Bishop, 1995; Ando et al., 2008) or wavelets (Donoho and Johnstone, 
1994; Fujii and Konishi, 2006). In this paper we apply i?-spline bases. Details of B- 
splines are referred to de Boor (2001) and Imoto and Konishi (2003). 



Using the above assumption and denoting ^/. = {yn, ..., yim)' , Da, = diag(xiife, • • • , Xin^k) 
and $ifc = (0 (Ui), • • • , <^ {Urii))' -I the varying coefficient model ([1]) is rewritten as 

V 

yi = ^Dik^iklk + £i^ £i ~ Ar„,(0,Si). (2) 

A;=0 

Then the varying coefficient model has the statistical model 
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X exp i -- ( 2/, - ^ Afc$ifc7;t J ^z M ?/i " £ ^i/^-^aTfc j L (3) 
where is a vector of unknown parameters. 

3 Estimation 

Unknown parameters, such as coefficient vectors "^^ and variance covariance matrices Ej 
are estimated by the maximum penalized likelihood method, that is, maximizing the 
penalized likelihood defined by 

V 

n 



1,(6) = 1(6) --J^^k-fi^kl,, (4) 

fc=i 

where 1(6) is a log-likelihood function given by 
1(6) = log f(Y\6) 

= -H^log(2vr)-^5^1og|S.| 



2 
- ^ E ( ^^ - E Afc$^fc7. ) ^i'(y,-J2 D^k^^klk ] (5) 

i=l \ fc=0 / \ fc=0 / 

and f2fc is a positive semi-definite matrix. Moreover, A^ are regularization parameters 
which control the effectiveness of the regularization. 

Since it is difficult to derive estimates of parameters analytically, we apply the backfit- 
ting algorithm (Hastie and Tibshirani, 1990; 1993) for maximizing (^. The first derivative 
of lx(6) with respect to 7^,, is given by 

n 

- Y, ^'ikD'ik^l'D,k^,ult - nXA-fk. (6) 



When coefficient parameters other than k-th are given, the backfitting algorithm itera- 
tively estimates the k-th coefficient as follows: 

for k = 1, . . . ,p. When variance covariance matrices Ej are given in the form of Sj = a'^Si 
with an unknown parameter a^ and known matrices Si, the parameter a^ is estimated by 

Then we have the statistical model 
'^ 1 

/(yi^)=n (2.,.)-/.|g.iv. 



exp < -- U. - XI Afc$ifc7fc -7 ^5, M 2/i - X I?ifc$ifc7fc 
I \ fc=o / \ fc=0 , 



(9) 



4 Model selection criteria 



The varying-coefficient model estimated by the maximum penalized likelihood method 
depends on regularization parameters. Smaller value of them yields overffited estimates 
for the data, while the larger value provides models which does not capture the structure. 
Therefore it is important to select appropriate values of them. We consider using model 
selection criteria derived from the information-theoretic and Bayesian approach. 

Konishi and Kitagawa (1996) derived an information criterion GIC for evaluating mod- 
els given by the M-estimate including maximum penalized likelihood method. Using this 
result, the GIC for evaluating the varying-coefficient model estimated by the maximum 
penalized likelihood described above is given by 

GIC = -2/(0) + 2tr{i?-^(^)g(^)|, (10) 

where R{&) and QiQ^ are, respectively, given by 
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where l^l\0) = l'^^{0) - (1/2) Yl^i ^jl'pjlj with the log-likelihood function of the z-th 
subject &'^{0). The detailed elements of R{0) and Q{0) are given in the Appendix. 

Konishi et al. (2004) extended the Schwarz's BIG (Schwarz, 1987) and derived a model 
selection criterion GBIG for evaluating models estimated by the penalized maximum 



likelihood method. The GBIC for evaluating the varying coefficient model (jH]) is given by 

p / ^ \ 

GBIC = - 2f{Y\0) + nJ2 ^k^i^kl, - E ^^ + M ^^S^^tt) 

fc=i \fc=i / 

/ p \ p 

+ J^r^ + 1 logn - J^log \n,\ + log \R{e)\, (12) 



,fc=i / fc=i 



where r^ = Mk — rank(f2fc). The derivation of (TT2|) is given in the Appendix. We select a 
set of regularization parameters {Xj} which minimizes the value of either of these criteria 
and then consider the corresponding model as the optimal model. 

5 Monte Carlo simulations 

We conducted Monte Carlo simulations to examine the effectiveness of our modeling pro- 
cedure. In the simulation study, we generated a data set {(tjj, |/jj, Xij); i = 1, . . . ,n, j = 
l,...,nj}, where Xij = {xiji,Xij2y, given in the following. First, time points tij were 
generated by tij ~ U{0, 1). Second, a response yij and two predictors Xij were derived as 
follows. 

Vij ~ fytij) + ^ijj fytij) = XijiPiytij) + Xij2l^2\tij), 

Xiji = tti cos{7itij) + bi, tti ~ A^(0, 4), bi ~ f/(2, 3), 

Xij2 = {0, 1}, 

(3i(tij) = sm{7itij), (32{tij) = Uj, 



Eij ~ iV(0, (T^), a = 0.05 <^ max f{t) - min f{t) ^ . 

We considered four patterns of sample sizes; i.e., n = 10,25,50,100, and also rij was 
generated as an integer value between 8 and 15 for different suffix number i. 

Based on the data set, we constructed an our varying-coefficient modeling procedure, 
where we use one-order (linear) 5-splines as basis functions, a positive semi-definite matrix 
fife {k = 1, 2) are assumed to be an identity matrix and the number of basis functions 
Mk = maxj=i^ „{nj} for k = 1,2. Regularization parameters in penalized likelihood 
function were selected by the GIC or the GBIC. In order to investigate the efficiency of 
proposed modeling strategy, we compare the five-fold cross validation (CV), which is one 
of the most widely used in smoothing parameter selection, with the GIC and the GBIC. 
We repeated the procedure for 1000 times, and then obtained 1000 averages of mean 
squared errors AMSE = J2iJ2j{fi^ij) ~ yij}^/i^J2i^i)^ where ijij is a predictive value. 

Table [T] shows results of simulation studies and means of regularization parameters 
Ai and A2 for 1000 trials. It may be seen from the table that the models evaluated by 
the GIC or the GBIC are competitive or superior to those by the CV with respect to 
minimizing AMSE. Especially, when the sample size is small, our proposed methods seem 
to outperform methods by the CV. 



Table 1: Comparisons of averaged mean squared errors using 1000 repetitions. 



n = 10 


GIC 


GBIC 


CV 


n = 25 


GIC 


GBIC 


CV 


AMSE 


36.45 


36.48 


36.59 


AMSE 


37.52 


37.53 


37.55 


Ai (xlO-i) 


37.10 


7.765 


9.063 


Ai (xlO-i) 


34.74 


9.574 


10.49 


A2 (xlO-2) 


1.680 


1.683 


1.506 


A2 (xlO-2) 


1.077 


1.078 


1.084 


n = 50 








n = 100 








AMSE 


37.75 


37.75 


37.76 


AMSE 


38.15 


38.15 


38.16 


Ai (xlO-i) 


34.85 


10.94 


13.83 


Ai (xlO-i) 


37.44 


12.29 


20.12 


A2(xlO-2) 


1.071 


1.073 


1.076 


A2 (xlO-2) 


1.066 


1.067 


1.068 



6 Real data example 

We applied the proposed modeling strategy to the analysis of the Multicenter AIDS Cohort 
Study data in order to capture the fluctuation of the percentage of the CD4 cells in the 
blood of the human who are infected with the Human Immunodeficiency virus (HIV). The 
data set contains cigarette smoking status, age at HIV infection, pre-HIV infection CD4 
cell percent and the CD4 cell percentage of a subject, observed at distinct time points 
after HIV infection. Fan and Zhang (2000) analyzed them using functional version of the 
ANOVA models, and Huang et al. (2004) applied the time varying-coefficient models and 
then evaluated the model via the cross-validation. 

We represent the relationship of variables described above by the time varying-coefficient 
model written by 

Y^{t) = /3o(t) + X,Mt) + X,2/32(t) + XrMt) + e^{t), 

where Xji, Xj2, Xi^ represents centered cigarette smoking status, age at HIV infection 
and pre-infection CD4 percent of the z-th subject, respectively, Yi{t) denotes the CD4 
percent of the i-th. subject observed at differing time points, Pjit) (j = 0, . . . , 3) are time 
varying coefficients and ei{t) is the error function. The model was fitted by the maximum 
penalized likelihood method with linear 5-splines, and then it was evaluated by model 
selection GBIC derived in the previous section. We generated 100 sets of bootstrap 
samples from the data, then obtained each estimates of coefficient functions. 

Figure [T] shows the result of the application of the varying coefficient modeling. Solid 
lines are mean coefficient functions of 100 bootstrap samples and dashed lines are pointwise 
95 % confidence intervals. The results suggest as follows. (1) The CD4 data have a trend 
that decreases with time, especially in early time. (2) PreCD4 has a positive influence 
on CD4 cell percentage, but it gradually becomes weak with time. (3) Age and smoking 
status are less influence on the CD4 percentage. These results are quite similar to those 
of Huang et al. (2004). In addition we want to note that the linear splines enable us to 
understand the fluctuation of coefficient functions more clearly. 
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Figure 1: Estimated coefficient functions of the varying coefficient model. Top left: 
intercept, top right: smoking status, bottom left: age at HIV infection, bottom right: 
pre-infection CD4 percent. 



7 Concluding Remarks 

In this article, we have developed a varying-coefficient model based on basis expansion 
approach by maximum penalized likelihood procedure. In order to choose values of regu- 
larization parameters, we have introduced model selection criteria from the information- 
theoretic and Bayesian viewpoints. We have applied our proposed method into some 
synthetic examples and Multicenter AIDS Cohort Study data. These results offers the 
effectiveness of our modeling strategy. Due to the stability and the predictive performance 
of the constructed models, our varying-coefficient modeling strategy has the potential to 
be useful in a variety of practical applications. 

In the future work, we will extend our model to discrete response for longitudinal data 
by using generalized linear models. 



Appendix 

A.l Derivation of the GBIC 

We consider the prior density of 6 as 



p 
7r(0|Ai, . . . , Ap) = J](27r)-^^^^(nAfc)^^^^|fifc|^ exp |-|a,7A-7J , (13) 



fc=i 



where r^ = M^ — rank(r2fc). Then the marginal hkehhood function of Y , given regulariza- 
tion parameters Ai, . . . , Ap, is given by 



p(F|Ai, ...Ap) = j f{Y\e)7i{e\\,, . . . , x,)dd 



exp 

(27r) 



nx-log{/(y|0)7r(0|Ai,...,Ap)} 



d6> 



n'^l^\R{e)\^/'' 



exp 



nx-log{/(r|^)7r(^|Ai,...,Ap)} 



n 



where d = ^1^=1 '^fc + 1> ^^"^ ^^^ Laplace approximation is applied. Multiplying minus 
twice of the marginal log-likelihood function, we have 



-21ogp(F|Ai, . . . , Ap) ^ - 2f{Y\e) + nY,h'r'k^kJk " E ^'^ + M ^^^^^ 

fc=l \/c=l / 

' p \ p 

J^rfc + 1 logn - ^log Ififel + log |i?(^)|. 



.fc=i 



fc=i 



A. 2 Elements of matrices of the GIC 

Elements Matrices in GIC, defined in (fTTI) . are given as follows: 
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where 1„ = (1, . . . , 1)' is an n- dimensional vector and Aj = diag{|/ji— ^^^q Xiik'y'k4' (^ii), 

• • • ) ?/in.i ~ Z^k=0'^iriik'yk^ I'-mJ/- 
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